docs: 📝 pseudo code and docstring for `write_resource_parquet()` #816

lwjohnst86 · 2024-10-25T02:32:29Z

Description

Based on @martonvago's suggestion, I'll write things in "pseudocode" from now on. But instead of pseudocode, I will write an outline of the Python function with how I think it might flow inside. Plus, I can write the full docstrings inside, so you all don't need and we don't need to move it over from the Quarto doc. I have NOT ran this, tested it, or did any execution, this is purely how I think it might work, hence "pseudo" 😛. I'll add some comments directly to the code in the PR.

Closes #642

This PR needs an in-depth review.

Checklist

Updated documentation

Closes #642

docs/design/implementation/python-functions.qmd

sprout/core/write_resource_parquet.py

martonvago

Very nice!! Just some questions.

I think I've developed some confusion about what the raw data files represent. Are they different versions of the data (with later versions overwriting earlier ones) or different sections of the data (e.g. one file for rows 1-100 and another one for rows 101-200)? Well, I guess there is no reason why they couldn't be used as both...

sprout/core/write_resource_parquet.py

lwjohnst86 · 2024-10-29T13:02:05Z

@martonvago I forgot to respond to your initial question.

Raw files are kept from the initial upload to keep a record just in case something happens.

A potential scenario might be, a first round of surveys are sent to people and that data gets uploaded to Sprout. That's one raw file. Maybe a few months later, the same survey is sent out and that data gets uploaded. That's another raw file. So those two raw files get merged together and saved as the data.parquet file, which would be the file that researchers actually use to do analyses.

signekb

The overall picture of this makes sense to me as well 👍

sprout/core/write_resource_parquet.py

…prout into docs/write-resource-parquet-pseudocode

lwjohnst86 · 2025-02-13T10:47:47Z

docs/design/interface/functions.qmd

@@ -127,13 +127,21 @@ flowchart
    function --> out
 ```

-### {{< var wip >}} `write_resource_parquet(raw_files, path)`
+### {{< var wip >}} `build_resource_parquet(raw_files_path, resource_properties)`


I'm unsure of the naming here. And I'm unsure if it should output a DataFrame and have another function write_resource_parquet() that does the writing.

Hmm, if that DataFrame output is used somewhere else, have 2 functions, otherwise have one function that does the writing as well?
build or create sounds okay to me.

lwjohnst86 · 2025-02-13T10:49:43Z

docs/design/interface/pseudocode/build_resource_parquet.py

+    While Sprout generally assumes
+    that the files stored in the `resources/raw/` folder have already been
+    verified and validated, this function does some quick verification checks
+    of the data after reading it into Python from the raw file(s) by comparing
+    with the current properties given by the `resource_properties`. All data in the


Suggested change

While Sprout generally assumes

that the files stored in the `resources/raw/` folder have already been

verified and validated, this function does some quick verification checks

of the data after reading it into Python from the raw file(s) by comparing

with the current properties given by the `resource_properties`. All data in the

While Sprout generally assumes

that the files stored in the `resources/raw/` folder are already correctly

structured and tidy, it still runs checks to ensure the data are correct

by comparing to the properties. All data in the

martonvago

This looks very sensible to me 😁

martonvago · 2025-02-13T11:54:25Z

docs/design/interface/functions.qmd

@@ -127,13 +127,21 @@ flowchart
    function --> out
 ```

-### {{< var wip >}} `write_resource_parquet(raw_files, path)`
+### {{< var wip >}} `build_resource_parquet(raw_files_path, resource_properties)`


Hmm, if that DataFrame output is used somewhere else, have 2 functions, otherwise have one function that does the writing as well?
build or create sounds okay to me.

martonvago · 2025-02-13T11:58:01Z

docs/design/interface/pseudocode/build_resource_parquet.py

+
+    If there are any duplicate observation units in the data, only the most recent
+    observation unit will be kept. This way, if there are any errors or mistakes
+    in older raw files that has been corrected in later files, the mistake can still


Suggested change

in older raw files that has been corrected in later files, the mistake can still

in older raw files that have been corrected in later files, the mistake can still

martonvago · 2025-02-13T11:59:49Z

docs/design/interface/pseudocode/build_resource_parquet.py

+        sp.write_resource_parquet(
+            raw_files_path=sp.path_resources_raw_files(1, 1),
+            parquet_path=sp.path_resource_data(1, 1),
+            properties_path=sp.path_package_properties(1, 1),
+        )


Does this need to be updated?

docs: 📝 pseudo code and docstring for write_resource_parquet()

9355eb9

Closes #642

lwjohnst86 requested a review from a team as a code owner October 25, 2024 02:32

github-actions bot assigned lwjohnst86 Oct 25, 2024

lwjohnst86 commented Oct 25, 2024

View reviewed changes

martonvago reviewed Oct 25, 2024

View reviewed changes

lwjohnst86 commented Oct 28, 2024

View reviewed changes

sprout/core/write_resource_parquet.py Outdated Show resolved Hide resolved

signekb reviewed Nov 5, 2024

View reviewed changes

lwjohnst86 added 2 commits November 11, 2024 13:45

Merge branch 'main' of https://github.com/seedcase-project/seedcase-s…

07ef602

…prout into docs/write-resource-parquet-pseudocode

chore: 🚚 move file into pseudocode folder

b316ae5

lwjohnst86 marked this pull request as draft November 13, 2024 14:13

lwjohnst86 added 4 commits February 13, 2025 08:59

Merge branch 'main' of https://github.com/seedcase-project/seedcase-s…

a9040a2

…prout into docs/write-resource-parquet-pseudocode

docs: 📝 add Mermaid diagram and rename to build_

63a8ed4

docs: 🏗️ updated and finished pseudocode for build_resource_parquet()

fa28b98

chore: 🚚 rename and move to interface/

fa590b6

lwjohnst86 marked this pull request as ready for review February 13, 2025 08:23

lwjohnst86 requested review from signekb and martonvago February 13, 2025 08:23

lwjohnst86 commented Feb 13, 2025

View reviewed changes

martonvago requested changes Feb 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: 📝 pseudo code and docstring for `write_resource_parquet()` #816

docs: 📝 pseudo code and docstring for `write_resource_parquet()` #816

lwjohnst86 commented Oct 25, 2024 •

edited

Loading

martonvago left a comment

lwjohnst86 commented Oct 29, 2024

signekb left a comment

lwjohnst86 Feb 13, 2025

martonvago Feb 13, 2025

lwjohnst86 Feb 13, 2025

martonvago left a comment

martonvago Feb 13, 2025

martonvago Feb 13, 2025

martonvago Feb 13, 2025

	in older raw files that has been corrected in later files, the mistake can still
	in older raw files that have been corrected in later files, the mistake can still

docs: 📝 pseudo code and docstring for write_resource_parquet() #816

Are you sure you want to change the base?

docs: 📝 pseudo code and docstring for write_resource_parquet() #816

Conversation

lwjohnst86 commented Oct 25, 2024 • edited Loading

Description

Checklist

martonvago left a comment

Choose a reason for hiding this comment

lwjohnst86 commented Oct 29, 2024

signekb left a comment

Choose a reason for hiding this comment

lwjohnst86 Feb 13, 2025

Choose a reason for hiding this comment

martonvago Feb 13, 2025

Choose a reason for hiding this comment

lwjohnst86 Feb 13, 2025

Choose a reason for hiding this comment

martonvago left a comment

Choose a reason for hiding this comment

martonvago Feb 13, 2025

Choose a reason for hiding this comment

martonvago Feb 13, 2025

Choose a reason for hiding this comment

martonvago Feb 13, 2025

Choose a reason for hiding this comment

docs: 📝 pseudo code and docstring for `write_resource_parquet()` #816

docs: 📝 pseudo code and docstring for `write_resource_parquet()` #816

lwjohnst86 commented Oct 25, 2024 •

edited

Loading